Facilitate SIMD-Code-Generation in the Polyhedral Model by Hardware-aware Automatic Code-Transformation
نویسندگان
چکیده
Although Single Instruction Multiple Data (SIMD) units are available in general purpose processors already since the 1990s, state-of-the-art compilers are often still not capable to fully exploit them, i.e., they may miss to achieve the best possible performance. We present a new hardware-aware and adaptive loop tiling approach that is based on polyhedral transformations and explicitly dedicated to improve on auto-vectorization. It is an extension to the tiling algorithm implemented within the PluTo framework [4, 5]. In its default setting, PluTo uses static tile sizes and is already capable to enable the use of SIMD units but not primarily targeted to optimize it. We experimented with different tile sizes and found a strong relationship between their choice, cache size parameters and performance. Based on this, we designed an adaptive procedure that specifically tiles vectorizable loops with dynamically calculated sizes. The blocking is automatically fitted to the amount of data read in loop iterations, the available SIMD units and the cache sizes. The adaptive parts are built upon straightforward calculations that are experimentally verified and evaluated. Our results show significant improvements in the number of instructions vectorized, cache miss rates and, finally, running times.
منابع مشابه
Automatic Code Generation for SIMD Hardware Accelerators
SIMD hardware accelerators offer an alternative to manycores when energy consumption and performance are critical. For scientific computing, GPGPUs are used in many computers of the top-500. But embedded processors also use accelerators. However such heterogeneous platforms trade ease of developments for performance: The application code and the data must be split between the host and the accel...
متن کاملPIPS Is not (just) Polyhedral Software Adding GPU Code Generation in PIPS
Parallel and heterogeneous computing are growing in audience thanks to the increased performance brought by ubiquitous manycores and GPUs. However, available programming models, like OPENCL or CUDA, are far from being straightforward to use. As a consequence, several automated or semi-automated approaches have been proposed to automatically generate hardware-level codes from high-level sequenti...
متن کاملtobias christian grosser Diploma
Sustained growth in high performance computing and the availability of advanced mobile devices increase the use of computation intensive applications. To ensure fast execution and consequently low power usage modern hardware provides multi-level caches, multiple cores, SIMD instructions or even dedicated vector accelerators. Taking advantage of those manually is difficult and often not possible...
متن کاملAutomatic Transformations for Effective Parallel Execution on Intel Many Integrated Core
We demonstrate in this work the potential effectiveness of a source-to-source framework for automatically optimizing a sub-class of affine programs on the Intel Many Integrated Core Architecture. Data locality is achieved through complex and automated loop transformations within the polyhedral framework to enable parallel tiling, and the resulting tiles are processed by an aggressive automatic ...
متن کاملEfficient Code Generation for Automatic Parallelization and Optimization
Supercompilers look for the best execution order of the statement instances in the most compute intensive kernels. It has been extensively shown that the polyhedral model provides convenient abstractions to find and perform the useful program transformations. Nevertheless, the current polyhedral code generation algorithms lack for flexibility by adressing mainly unimodular or at least invertibl...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012